Abstract
This paper will present a complete statistical analysis of Airbnb data from six major european cities. We start by presenting an exploratory analysis of our dataset, in which we try to identify both the most relevant components that drive prices and possible differences between the collected cities. We then evaluate different statistical learning models that predict the prices given different instrumental variables. At the end we generate different clusters both from the whole dataset and from subset related to single cities, to better understand the composition of the dataset.
We find that the most important drivers for prices are (apart from the actual city in which is located the property) the type of the room, weather is shared or private, and the number of guests that can stay at the property, the higher the number, the lower the price. These main variables are followed by others like the rating of the property, the number of bathrooms and the presence of air conditioning.
We estimated a simple pruned tree, a Random Forest and a Boosted tree model which achieve a Root Mean Squared Error from 80 to 73.
We finally try and cluster our data but we find that the clusters have no actual grographical interpretation.
Airbnb is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. It was born in 2007 and, since then, it has grown to 4 million Hosts who have welcomed more than 900 million guest arrivals in almost every country across the globe. The app works as most of the other bookings apps: a client can just select a city he/she want to visit, the dates of the trip and he/she will get a list of possible locations.
The main feature that makes Airbnb unique in its kind is that in the website you can find not only hotels and apartments, but also single rooms that are rented out for a few days from private Hosts.
In this kind of environment Hosts are pushed to compete in an almost-free market, where they have to set the right price for their properties in order to stay competitive and win guests over. We can assume that a good percentage of private Hosts has no experience in both Marketing and Real-Estate, hence the definition of the right price could become a entry barrier that stops most of them from ever trying to compete.
Our goal in this paper will be to understand which are the components that drive prices and to construct a model that can help both Hosts to define the right price for their properties and Guests to check weather a location is truthfully priced.
We start our analysis by presenting our dataset. Before diving into the actual variables we present the six different cities and the related number of records:
For a better understanding we splitted the variables into five different groups that will be presented below.
The main goal of this analysis is to find what drive prices, the main issue we need to address before diving in is: How do we define Price?.
We observe that there are different variables related to prices and fees, that are:
For the scope of this paper we decide that we will define price as the per-person price of a seven-night stay at the property, computed as:
# Compute new price and remove useless columns
df = df %>% mutate(Price = (7*Price + Cleaning.Fee)/Accommodates) %>% select(-Cleaning.Fee)
We now present our dataset in a more sofisticated way, focusing on the relationships between the different variables that we collected and the prices.
We start by showing how prices differs by city:
We can see from the two graphs above that 4 out of the 6 cities are quite similar (Barcelona, Rome, Wien and Berlin), with an average 7-nights per-person price of about 160€. The two “outliers” are London, with a value of 220€ and Amsterdam, which unexpectedly shows a price of almost 350€.
Because we collected a sample dataset of less than 10k observations, we know that our analysis cannot assess that Amsterdam is overrall the most expensive city; that is because in the investigation above we do not control for other factors, such as accomodation and service features.
Proceeding with the order we used in the Data section, we will now try to assess if the different groups of variables could be considered a driver for prices.
We will now show the qualitative and quantitative correlations between the variables of our dataset:
From the graph above we can see that there is an high correlation between the variables which define the actual property, this is basically due to the fact that a high number of beds leads to a lot of bathrooms for the guests which leads to high square footage etc. This first interpretation is pretty basic and we will not dive deeper into that.
As spotted before we can also see the small correlation between Host.Since and Number of reviews, this could be explained by the fact that a property listed for a long time gets more guests and consequently more reviews.
From the graph above we can start drawing some key findings:
After this exploratory analysis, which helped us to better visualize and refine our dataset, we start with a supervised learning approach. The goal here would be to build a model that can accurately predict the price of a property given its features; this kind of model could be implemented in different use cases such as:
Given the composition of our dataset, that present both quantitative and qualitative variables, we decide to implement a Tree-based algorithm. This kind of algorithm is definetly efficent in hybrid datasets and will provide us with great interpretability; in this way we can both get an efficent model and understand which are the main variables that drive Airbnb prices.
We start by estimating a tree over the whole dataset (80-20 train-test split).
##
## Regression tree:
## rpart(formula = Price ~ ., data = df_model, subset = train.full)
##
## Variables actually used in tree construction:
## [1] Accommodates City Room.Type
##
## Root node error: 113060445/7780 = 14532
##
## n= 7780
##
## CP nsplit rel error xerror xstd
## 1 0.296703 0 1.00000 1.00030 0.037903
## 2 0.025776 1 0.70330 0.70400 0.032991
## 3 0.021319 4 0.62597 0.62826 0.031799
## 4 0.010646 6 0.58333 0.58750 0.031629
## 5 0.010373 7 0.57269 0.57253 0.031530
## 6 0.010000 9 0.55194 0.57041 0.031993
From the tree structure we can easily identify the main drivers for our price variables:
We now proceed to predict the values for our test set and evaluate them.
## [1] "Root Mean Squared Error: 80.8827730952067"
We now proceed by trying to estimate different trees for the different cities to see if there is any difference in the nodes:
From the trees above we can see that the six different datasets construct trees that differs. The nodes that are always present in all six trees are obviously room type and accomodates, which, as seen also above, seems to be the main drivers for the price. The other variables that appears as expected are mainly:
At last we try to construct a more robust model in which we try to achieve more accurate results at the expense of some interpretability.
We start by evaluating a Random Forest over our full dataset.
##
## Call:
## randomForest(formula = Price ~ ., data = df_model, subset = train.full)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 8
##
## Mean of squared residuals: 6300.264
## % Var explained: 56.65
## [1] "Root Mean Squared Error: 73.1557067944934"
We can see that this kind of model explain just 56% of the variability of our dataset, and it returns a Root Mean Squared Error of 73.15 (a major improvement from the 80 of the single tree model).
We proceed by applying boosting.
## var rel.inf
## City City 44.244017976
## Accommodates Accommodates 13.864894540
## Room.Type Room.Type 10.122921249
## Review.Scores.Rating Review.Scores.Rating 5.231249883
## Security.Deposit Security.Deposit 4.487169469
## Host.Since Host.Since 4.088197021
## Host.Response.Rate Host.Response.Rate 2.995677382
## Bedrooms Bedrooms 2.704422681
## Property.Type Property.Type 2.619921511
## Bathrooms Bathrooms 2.509292531
## Number.of.Reviews Number.of.Reviews 1.928825245
## Air_conditioning Air_conditioning 1.739716127
## Cancellation.Policy Cancellation.Policy 0.755789815
## Maximum.Nights Maximum.Nights 0.710782791
## Experiences.Offered Experiences.Offered 0.576914834
## Host.Response.Time Host.Response.Time 0.399784217
## Beds Beds 0.366973051
## Instant_Bookable Instant_Bookable 0.318356425
## Bed.Type Bed.Type 0.192047755
## Host.SuperHost Host.SuperHost 0.043803663
## Washer Washer 0.034018054
## Host.verified Host.verified 0.031642412
## Kitchen Kitchen 0.029493105
## Breakfast Breakfast 0.004088264
## Host.ProfilePic Host.ProfilePic 0.000000000
## [1] "Root Mean Squared Error: 73.8598968588597"
We can see from the Relative influence plot shown above that the most relevant drivers for prices are the same we found in the single tree model. In this case, as in the RF model, the Root Mean Squared Error lowers, at 73.85.
We can conclude that, while this kind of models gave us a huge help in interpret our dataset, they still seem to slightly underperform. The reason for this could be the fact that our dataset is really heterogeneous, beign composed by cities that differs so much even in the mean price level. Another reason could be the fact that we still do not have all the variables we need to correctly construct a prediction model.
Some Next steps in order to improve the performance of our model and perform a more complete analysis could be:
Now we will try and approach the problem with unsupervised learning models. This new approach could help us better understand our findings of the analysis above but also show us a different point of view when looking at the dataset.
We will start from the most famous unsupervised learning model, that can help us identify the principal components in which our data are spread out.
Before computing the model we recall that PCA can only be applied to quantitative variables; this means that even if we expect the results of this model be aligned with our previous findings, we cannot fully rely on this model in order to completely analyse our dataset.
From the plots above we can see that there is clearly one dimension in which the data are spread out the most, and it is the one related to the actual property (accomodates, beds, bathrooms and bedrooms).
We then have a second and third dimension, both capturing less than a half of the variability of the first dimension and they are respectively a value-of-property dimension, given the fact that the main components are Price, Rating and Host.Since (recall that we interpred experienced host as an added value) and what it seems to be a popularity dimension, given the fact that its main component is given by the number of reviews.
This model is aligned with our first analysis, and it assess that the variability of the dataset can be reconducted by group of variables that we already presented as correlated.
Unfortunately we cannot see any clear cluster in the graph reporting the projected dimensions, this could be easily caused by the fact that all qualitative variables were left out from this model.
We shown below that the kmeans algorithm confirm our expectations: we are not able to identify clear clusters just by looking at qunatitative variables.
As expected the two clusters we get (that seems to try and separate small properties, with low number of beds and accomodetes from large ones) are completely overlapped.
Given what we said aboce, we will now try to compute different cluster using hierarchical clustering; in this way we would be able to add also qualitative variables to our analysis, hopefully improving the model.
In order to evaluate the model with both quantitative and qualitative variables we use the Gower distance.
We start by trying to compute 6 clusters (that we hope to reconduct to our 6 cities). We will remove in this analysis the variable City in order to actually understand if there our cities actually differs.
From the graphs above we can see that there is not actual difference and the 6 clusters seems to overlap in each of the cities.
We will now try the same method but with a small number of cluster, in order to understand if there are some differences between properties that are not related to the city but maybe could be related by different locations in the cities
As can be see from the graph above we did not get the results that we expected, the clusters seems to overlap in all the cities without any geographical meaning.
We will now try and define different clustering models, one for each city, in order to understand if our data can be grouped in a way that has some geographical interpretation.
We can notice that in some cities (like London and Berlin) there seems to be a slightly geographical interpretation of the clusters, but they still seem to overlap in most part of the city.
This unsupervised approach definetly helped us for the first part. in identifying the principal components in which our data are spread out.
Unfortunately we did not obtain the result that we expected in the clustering analysis; this could be due to the fact that we have a lot of variables (like host-related) that are not correlated to the construction of geographical clusters.
Some Next steps for this kind of analysis would be: